Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples
نویسندگان
چکیده
The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). When analyzing empirical data (whole-genome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.
منابع مشابه
Introducing a New SYBR green Real-time PCR for Detection of SARS-CoV2 Virus Genome
Background and purpose: There are various methods for molecular detection of SARS-CoV2 genome among which, PCR-based methods are the most reliable for making diagnosis. The majority of approved PCR kits for detection of Coronavirus are based on TaqMan real-time PCR which is expensive due to incorporating fluorescent and quencher-harboring probe. The aim of this study was to design a simple and ...
متن کاملI-20: Towards The Transparent Embryo: Dynamics and Ethics of Comprehensive Preimplantation Genetic Screening
Background: To study the ethical aspects of comprehensive preimplantation genetic screening (PGS) through microarrays and whole genome sequencing Materials and Methods: In order to pinpoint ethical issues regarding comprehensive embryo screening we have first investigated the technical and moral issues by organizing a campus meeting with experts and by a literature study. Subsequently we have i...
متن کاملSequencing and Molecular Analysis of ATP 6 and ATP 8 of Mitochondrial Genome in Khorasanian Native Chickens
In order to perform breeding programs and improve production of native chickens, preserving genetic diversity in different areas of Iran is important due to the reduced available population. Genome sequencing is considered the most functional approach to determine the phylogeny relation between close populations. The aim of the present study was the evaluation of the phylogeny and genetic nucle...
متن کاملI-37: Establishing High Resolution Genomic Profiles of Single Cells Using Microarray and Next-Generation Sequencing Technologies
The nature and pace of genome mutation is largely unknown. Standard methods to investigate DNA-mutation rely on arraying or sequencing DNA from a population of cells, hence the genetic composition of individual cells is lost and de novo mutation in cell(s) is concealed within the bulk signal. We developed methods based on (SNP-) arraying and next-generation sequencing of single-cell whole-genom...
متن کاملClustering of Short Read Sequences for de novo Transcriptome Assembly
Given the importance of transcriptome analysis in various biological studies and considering thevast amount of whole transcriptome sequencing data, it seems necessary to develop analgorithm to assemble transcriptome data. In this study we propose an algorithm fortranscriptome assembly in the absence of a reference genome. First, the contiguous sequencesare generated using de Bruijn graph with d...
متن کامل